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[57] ABSTRACT 

A VLIW processor has first and second functional units for 
executing first and second commands in a first instruction 
word. The first and second commands comprise a first field 
and a second field, respectively, in ordered concatenations of 
fields. The processor has a third functional unit for executing 
a third command in a second instruction word. The third 
command comprises both the first and second fields. 
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VLIW PROCESSOR HAS DIFFERENT of the commands having a second one of the fields. The 

FUNCTIONAL UNITS OPERATING ON second word comprises a third one of the commands having 

COMMANDS OF DIFFERENT WIDTHS the first and second ones of the fields. The processor has a 

first functional unit coupled to the issue port for processing 

HELD OF THE INVENTION 5 the first command, and a second functional unit coupled to 

™ . . . xrrt^xT/xr T t . ^^e issue port for processing the second command in lock- 

■n,e invention relates to a VLIW (Vfery Large Instruction ^^^^ ^^^^ g^, ^^^^^ ^ j^j^^ 

Word) processor. The mvenUon also relates to a program for fi^ctional unit coupled to the issue port for pmcessing the 

such a processor and to compilation of such a program, command 

BACKGROUND ART invention is based on a flexible division of the 

instruction issue port into issue slots. This means that is 

An example of a VLIW processor is the TM-1000 pro- possible, for example, to use different commands that use 

cessor (TriMedia) of Philips Electronics. This processor is different numbers of fields for operands. Also, different 

described in, for example, European patent application No. commands may have opcodes of different sizes. Fields that 
EP 605 927 (equivalent to U.S. Ser. No, 07/999,080 now 15 are associated with specialized resources of the circuitry 

abandoned; PHA 21777). In a VLIW processor, parallel processing the commands, here the functional units, can be 

execution of instructions is obtained by combining multiple used in combination with functional units from different 

basic machine commands in a single long instruction word. groups, associated with different issue slots. This allows the 

Typically, each such basic command represents a RISC use of more complex commands than with in instruction 
operation. Per clock cycle, a long instruction word is sup- 20 issue port divided into fixed issue slots. There is no need to 

plied to a parallel arrangement of functional units that reserve more space in the issue slots for handling the most 

operate in lock-step. A respective one of the commands is complex command. In a conventional VUW processor, such 

supplied to a relevant one of the units. Topically, a unit complex commands have to be implemented using several 

performs pipehned execution. less complex commands in a sequence of multiple instruc- 

The TM-1000 processor issues the commands in parallel, tion words. Accordingly, the invention enables execution of 

each in a respective issue slot of the very long instruction programs faster than is possible with a conventional VLIW 

word issue register. Each issue slot is associated with a processor, because completion of the program requires 

respective group of functional units and with two read ports fewer instruction words. 

and one write port to the register file. A particular command For example, in a machine such as the TM-1000, an 

is directed to a specific one among the functional units of the operation which computes a result from three operands (for 

group that is associated with the particular issue slot. The example, an averaging operation) would require at least two 

command typically comprises an opcode, two source oper- successively executed commands and therefore at least two 

and definitions and a result operand definition. The source instructions. By providing for a command with fields for 

operand definitions and the result operand definition refer to more than two operands, the operation can be executed using 

registers in the register file. During execution of the one command. Moreover, because the fields are assigned 

command, the source operands are read from the particular flexibly to commands, this can be realized without reserving 

issue slot by supplying fetch signals to the read ports more than two fields for operands for all commands. In an 

associated with the issue slot in order to fetch the operands. embodiment of the invention, the third command contains 

Typically, the functional unit receives the operands from all the fields used for execution of the first and second 

these read ports, executes the command according to the commands. So, for example, if first and second fixed size 

opcode and writes back a result into the register file via the issue slots are used to issue the first and second command 

write port associated with the particular issue slot, respectively, the third command may use a combination of 

Alternatively, commands may use fewer than two operands the first and second issue slots. This simplifies scheduling, 

and/or produce no result for the register file. Different fields in each command, i.e., in each issue slot, 

A typical program for the VLIW processor is translated may each have a fixed fiinctionaMty, such as representing a 

into a set of commands for the functional units. A compile read address of an operand register, or representing a write 

time scheduler distributes these commands over the long address of a result register or representing an opcode. These 

instruction words. The scheduler attempts to minimize the different fields will be associated with fixed parts of the 

time needed to execute the program by optimizing parallel- instruction processing circuits, like read ports or opcode 

ism. The scheduler combines commands into instruction decoders. In this case, the third instruction can use twice the 

words under the constraint that the commands assigned to number of operands and/or produce twice the number of 

the same instruction can be executed in parallel and under results and/or use a double size opcode. When a respective 

data dependency constraints. one of the issue slots is associated with a respective group 

of functional units, a functional unit that executes the third 

SUMMARY OF THE INVENTION (.Qj^m^jj^j belongs to both groups at the same time. 

Itisoneof the objects of the invention to provide a VLIW The invention also relates to a compiler for compiling 

processor whose architecture enables reducing the number programs for a VLIW processor with fiexible assignments of 

of instruction words required for execution of a program ^^^f^ ^ instructions and to programs having such a flexible 
with respect to conventional VLIW processors. go assignment. 

To this end, the VLIW processor of the invention com- gRigp DESCRIPTION OF THE DRAWING 
prises an instruction issue port for sequentially supplying 

first and second very long instruction words. Each respective These and other aspects of the invention are illustrated by 

one of the words comprises a respective ordered concatena- way of example in the accompanying drawing, wherein: 
tion of fields distributed among a respective concatenation 65 FIG. 1 is a block diagram of a VLIW processor according 

of commands. The first word comprises a first one of the to the invention; 

commands having a first one of the fields, and a second one FIGS. 2a-< are diagrams of instruction word formats; 
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3 4 

FIG. 3 is a block diagram of a decode circuit for an issue outputs of read ports 3Q2ajb are coupled to iaputs of 

slot; functional units 32a-c, Outputs of functional units 32a-c are 

FIG. 4 is a flow diagram for compiling a program; and coupled to write port 342. Decode and launch circuit 12 

- , c J * • . 1 . contains circuits like those of FIG. 3 for each issue slot. 

FIG. 5 an assignment of commands to instruction slots, i^j ji -r *-r 

5 Decode and launch arcuit 12 processes mformation from 

PREFERRED EMBODIMENTS fields 22a-e, 24a~-e, 26fl-e, 2Sa-e according to the function 

of each field. Each of operand fields 24a-€, 26a-e is 

no. 1 is a diagram of a VLIW processor. The VLIW associated with a respective read port of register file 16. 

processor comprises an instruction word memory system 10. Decode and launch circuit 12 uses the content of fields 

a program counter 11, a decode and launch circuit 12. 24a-^, 26a^ to address the associated read port. Similarly, 

functional units 14a-k and a multi-port register file 16. target fields 28a-€ conespond to write ports of register file 

Counter 11 is coupled to an address mput of memory system 15, decode and launch circuit 12 uses the content of fields 

10. Memory system 10 has an instruction issue register (not 2Sa-^ to address the associated write port. Each of opcode 

shown) partitioned in issue slots (also called instruction fields 22^ is supplied to a respective instruction decoder 

bus). An issue slot has a number of functionally parallel 300 in decode and launch circuit 12, whereupon the decoded 

paths for routmg individual bits of the mslruction word opcodes are supplied to selected ones of functional units 

currently buffered in the issue register. An output of the issue i4^_jt. Typically, decode and launch circuit 12 uses pipe- 

register is coupled to decode and launch circuit 12. Outputs Hned operation, for example by initiating operand fetching 

of circuit 12 are coupled to functional units 14fl-* and to d^^ing decoding of the instruction word. Decoding and 

multiport register file 16. Read ports and write ports of fetching of a command from an instruction word is being 

register file 16 are coupled to functional units 14a-^. performed while a command from a previous instruction 

Memory system 10 suppHesmstrucUon words consecutively ^^^^ ^ ^^^i ^^^^ executed, and while a result of an 

to decode and launch circuit 12 under control of counter 11. command from an even earHer instruction word is being 

Preferably, system 10 uses instruction caching and/or written to its target location. Because the functions of fields 

prefetchmg but this is not essential to the invention. System 22a^. 24a-e, 26a-^, and 2Sa-^ are predefined and inde- 

10 may also perform decompress operations on instruction pendent of the format, operand fetching can start before 

words stored m a compressed format before outputting instruction decoding has been completed, 

them. Decode and launch circuit 12 receives the instruction Multiple functional units 14a-* are organized in groups 

words. Circuit 12 treats the functionally parallel paths from ^^^^ ^ ^tie group in FIG. 3. Each group is associated with 
memory system 10 as a collection of fields, each field being 3^ ^ respective issue slot in an instruction word of the first 

associated with one or more of the paths. A coUection of ^^^^^ ^^^^^ ^^^^^ j^^^^j^ ^.^^^-^ ^2 ^^^^^^ ^ 

fields makes up a command. A coUection of commands instruction word of the first format, circuit 12 determines for 

forms a single mstmcUon word. ^^^^ ^j^^ ^^^^ ^^^^p associated with that 

FIG. 2a shows an instruction word m the first format, slot, if any, should execute the command in the slot. That 
wherein the word is divided into a number of slots 20a-€. 35 functional unit subsequenUy receives control signals to 
Each respective one of slots 20a-* corresponds to a respec- execute the command. Thus, decode and launch circuit 12 
tive command. Each slot has a field for an opcode 22a-«. will cause the functional units in a group to start executing 
two fields for operands 24a-^, 26a-^ (expressing a reference one at a time. TypicaUy, functional units of the same type are 
to a relevant register in register file 16), and one field for a present in different groups. For example, each group com- 
target for a resuh 2Sa-^ (expressing a reference to a relevant prfses an ALU (Arithmetic Logic Unit). This prevents bottle- 
register in register file 16). necks caused by using no more than one functional unit of 

FIGS. 26, c show instruction words in a second format each group at a time, 

wherein multiple slots have been combined into "super- Some of functional units 14a-c do not belong to one 

slots". For example, superslot 29a combines the fields of group only. These units are referred to as super functional 
conventional slots 20a and 20b, and superslot 29b combines 45 units below. Each of super units 14a-< is associated with 

the fields of slots 20c-e, The superslot format enables two or more specific groups. This means that each of super 

implementing commands that use more operands than can units 14a~c can use operands from the read ports to register 

be implemented with a single conventional slot. GeneraUy. file 16 that are associated with these specific groups. Also, 

each individual field 22a-e, 24a~e, 26a-e, 2Sa-e has either super units 14a-c can use the write ports to register file 16 
the same function in superslots 29a and 29b as in a con- 50 that are associated with these specific groups. Commands 

ventional slot 20a-^ or no function at all. For example, a for super functional units 14a-^ come from instnicaon 

specific one of fields 24a-€ and 26a-€ define an operand in words in the second format. Such commands are located in 

the first format as well as in the second format, or it is not superslots 29a and 29b. Operand fields 24a,b and 26a,b in 

used at all. A specific one of fields 22a-^ is used for an the superslot serve for fetching operands from register file 
opcode in the first format as well as in the second format, or 55 16 for super units 14a^. Each of fields 24a,b and 26a,b is 

it is not used at all. A specific one of fields 2Sa^ is used for associated with the same read port for all instruction words, 

a target in the first format as well as in the second format or regardless of the instruction word's format. As a result, 

it is not used at all. Alternatively, fields 22a-^, 24a-€, 26a-e, fetching can start before the frjrmat is determined. Similarly, 

and 2Sa-€ in the second format have purposes different from target fields 2Sa,b serve to control the write ports to register 
those in the first format 60 file 16. Each of these fields is associated with the same write 

FIG. 3 is a diagram of the VLIW processor, with part of port for all instruction words, independent of the format, 

decode and launch circuit 30, functional units 32fl-c, and When decode and launch circuit 12 causes a particular super 

part of multiport register file 34. The part of decode and unit to start executing a command, circuit 12 will prevent 

launch circuits 30 comprises an instruction decoder 300 any of the functional units 14</-Jt from starting to execute a 
coupled to functional units 32a-c. Units 32a-^ are coupled 65 command in the groups associated with that super unit, 

to read ports 302a, 3026 of the register file 34. A result write Decoded opcode fields 22a, in superslots 29a,b correspond 

unit 304 is coupled to a write port 342 of register file 34. The to opcode fields 22fl,6 in the conventional slots 20a,b 
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associated with those groups, and are used to control only 
the relevant super unit X4a-c, One may use, for example, 
only the opcode of one of these slots to control the super 
unit, but one may also use a combine circuit to combine 
opcodes of two or more of those slots. Thus, a larger number 
of different operations can be defined for each of super units 
14fl-c. Super units 14a-c execute commands for implement- 
ing operations which require more than two operands and/or 
produce more than two results. Examples of such operations 
are: 

AV (R1,R2,R3 . . . ) Producing the average of three or 

more operands R1,R2,R3 . . . ; 
ME (R1,R2,R3 . . . ) Producing the median of three, five 

or more operands R1,R2,R3 . . . ; 
SO (R1,R2) Sorting two operands Rl and R2, the bigger 

operand being placed in a result register and the smaller 

in another result register; 
TP (R1,R2,R3,R4 . . . ) Transposition of matrix with rows 

R14^,R3J14; 

RT (R1,R2,R3) Rotation of a vector with components 
R1,R2,R3 over a specified angle. 

In a conventional VLIW processor, above operations 
require execution of several commands in sequence. The 
registers in the multiport register file are used in some cases 
to represent a combination of a set of small numbers. For 
example, if the registers are 64-bit wide, four 16-bit numbers 
could be represented per register. In this case, each of these 
numbers may be operated upon separately. For example, in 
response to an ADD command a functional unit may add 
four pairs of numbers from two registers. 

This approach can be used for super units as well. For 
example, foxir registers can represent a 4x4 matrix of 16-bit 
numbers. Each register contains a respective quadruplet: 
Rl^aU, al2, al3, al4), R2=(a21, a22, a23, a24), R3«(a31, 
a32, a33, a34), R4=(a41, a42, a43, a44), representing a 
respective one of the rows of the matrix. A set of components 
stored in different registers but in the same position of the 
corresponding quadruplet represents a column of the matrix. 
In a transposition operation, the components of different 
rows but in the same position are placed together in a 
register RESULTl=(all. a21, a31, a41), and RESULT2« 
(al2, a22, a32, a42). A super unit for transposing matrices 
could use two issue slots and generate two rows of a 4x4 
matrix. By providing two commands for such a functional 
unit, one for producing the two top rows of the transposed 
matrix and one for providing the two bottom rows, trans- 
position is obtained very quickly. 

A similar operation is a shuffle operation: 

SH Rl,R2^3-*R4 GR5) 

This operation permutes and/or selects numbers stored in 
registers Rl and R2 according to permutation defined in 
register R3 and causes the number to be stored in permuted 
order in register R4 and optionally in register R5. 

In some cases, one or more operands have standard 
values. In these cases it is advantageous to define an 
additional command, which fits in a single issue slot. In this 
additional command, the opcode defines the particular 
operation and the standard value of one or more of the 
operands. The standard value may be defined implicitly. The 
additional command contains operand references only to the 
remaining operands. Such a command can be used in a 
single issue slot of an instruction word having either the first 
or the second format. When decode and launch circuit 12 
encounters such an instruction it supplies the standard 
arguments to the super unit itself. Thus, the super unit can 
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receive a command both using one issue slot and using two 
or more issue slots. In the former case, the standard values 
are used and a greater number of commands can be included 
in the instruction word. 

5 In the embodiment of FIG. 1, each super unit uses all 
fields of the issue slots associated with an integer number of 
groups of functional units. An alternative super unit may use 
some, but not all, of the fields of the issue slots. For example, 
such an alternative unit may process three operands, two of 

•^Q which stem fi'om the fields of one particular issue slot and a 
third one of which comes from another issue slot. When such 
an alternative unit is used, the other fields of the relevant 
issue slots can be made available for other imits that can start 
executing in parallel with the alternative super unit. These 

15 other functional units may have, for example, only one 
operand or no operand at all, or may produce no result. 
These other units would use only some of the fields that the 
alternative unit leaves unused. Also, these other functional 
units might be alternative super functional units themselves, 

20 using some fields of the issue slot in addition to fields of 
another issue slot. However, the use of alternative functional 
units imposes complex restrictions on the combinations of 
units that can receive commands from a single instruction 
word. By using all fields of the issue slots, or at least by not 

25 using remaining fields in partly assigned slots, such con- 
straints are avoided. This enables utilizing a higher degree of 
parallelism and it makes compilation of the instruction 
words much easier. 
A compiler generates the instruction words for the VLIW 

3Q processor. The compiler describes a program in terms of a 
number of commands with data dependencies between the 
commands. The compiler searches for a way of placing all 
commands in a set of instruction words. The compiler 
performs a minimization of the number of instruction words 

35 that need to be executed sequentially during execution of the 
program. FIG. 4 is a diagram of flow chart for a method of 
compiling programs. In a first step 40, a set of operations is 
received together with a specification of data dependencies 
between operations. Subsequently, the compiler starts 

40 searching for a way of placing commands for the operations 
in a set of instruction words. Second step 42 tests whether 
commands have been placed for all the operations received. 
If so, the compilation process is completed. If not, third step 
44 selects an operation for which no command has yet been 

45 placed and for which preceding "source operations", which 
produce its operands, have already been placed. 
Furthermore, the earliest instruction word is selected fium 
the set of instruction words after the instruction words in 
which commands for the source operations have been 

50 placed. Fourth step 46 tests whether it is possible to con- 
struct an instruction word which contains the commands 
already included in that earliest instruction word plus a 
command for the selected operation. Step 46 takes into 
account the nature of the commands and the grouping of the 

55 functional units. It is tested whether it is possible to both 
place the commands in different groups; and 
place commands for super functional units so that no 
other commands use issue slots for the groups associ- 
ated with those super functional imits. 

60 If this is possible, the selected instruction word is updated 
and the method returns to second step 42. If this is not 
possible, a fifth step 48 is executed in which an instruction 
word subsequent to the selected instruction word is selected 
and foxirth step 46 is repeated. 

65 FIG. 5 is a diagram to explain fourth step 46 further. On 
the left, a number of operations is shown as first nodes 
SOa-d. On the right, a number of issue slots is shown as 
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second nodes 52a--€, The task of the fourth step is to test 
whether there is a mapping of the first nodes SOa-d, to the 
second nodes Slor-e. In this mapping an operation for a 
super functional unit SOd maps to two or more issue slots 
524^. The other first nodes SOa-d correspond to conven- 
tional operations and each maps to a single respective one of 
second nodes 52a-e. Each of nodes 52a-e corresponds to an 
issue slot associated with a group that contains a functional 
unit capable of executing the relevant operation. Of course, 
the flow chart of FIG. 4 is but a simplified example. In 
general, minimization is performed under constraints of data 
dependencies between commands (i.e., if a first command 
uses as input a result from a second command, these 
commands should be placed in different instruction words, 
the instruction word that contains the first command follow- 
ing the instruction word that contains the second command). 
Moreover, the minimization is performed under the con- 
straint that the functional imits are capable of starting 
execution of all commands in parallel for each instruction 
word. 
We claim: 

1. A method of compiling instructions for a VLIW 
processor, the processor containing multiple groups of func- 
tional units, each particular one of the functional units being 
associated with a single one of the groups, and also com- 
prising at least one super functional unit associated with at 
least two of the groups, each very long instruction word 
being allowed to contain at most one specific command for 
each specific one of the groups, each very long instruction 
word which contains a command for the super functional 
unit not having any commands for any of the functional units 
in said at least two of the groups, the method comprising: 

receiving a set of commands that are to be executed by the 
functional units; 
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searching for consistent assignments of the conunands to 
the very long instruction words, and, upon finding a 
first one of the commands for the super functional unit 
and at least a second one of the commands for any one 
5 of the functional units from said at least two of the 
groups, assigning the first command and the second 
command to different ones of the very long instruction 
words. 

2. A machine readable medium comprising a program for 
executing a method of compiling instructions for a VLIW 
processor, each instruction comprising a very long instruc- 
tion word, the processor containing multiple groups of 
functional units, each particular one of the functional units 
J 5 being associated with a single one of the groups, and at least 
one super functional imit associated with at least two of said 
groups, each very long instmction word being allowed to 
contain at most one specific command for each specific one 
of the groups, each very long instruction word which con- 
20 tains a command for ihc super functional unit not being 
allowed to contain commands for any of the functional units 
in said at least two of the groups, the method comprising: 
receiving a set of commands that are to be executed by the 
functional units; 
25 searching for consistent assignments of the commands to 
the very long instruction words, and, upon finding a 
first one of the commands for the super ftinctional unit 
and at least a second one of the commands for any one 
of the functional units from said at least two of the 
30 groups, assigning the first command and the second 
command to different ones of the very long instruction 
words. 

♦ * * * * 
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